.. _`Transform Table Dataset (Experimental)`: .. _`com.sympathyfordata.advancedmachinelearning.transformtabledataset`: Transform Table Dataset (Experimental) `````````````````````````````````````` .. image:: table_ds_transform.svg :width: 48 Transforms tabular dataset based on common preprocessing operations Documentation ::::::::::::: Algorithms ========== **Binarizer** Binarize data (set feature values to 0 or 1) according to a threshold. :threshold: Feature values below or equal to this are replaced by 0, above it by 1. Threshold may not be less than 0 for operations on sparse matrices. **LabelEncoder** Encode target labels with value between 0 and n_classes-1. :Use categorical: **OneHotEncoder** Encode categorical features as a one-hot numeric array. :Handle Unknown: Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None. :Desired data type: Desired dtype of output. :Transformed array in sparse format: Will return sparse matrix if set True else will return an array. :Categories: Categories (unique values) per feature: ‘auto’ : Determine categories automatically from the training data. list : categories[i] holds the categories expected in the ith column. The passed categories should not mix strings and numeric values within a single feature, and should be sorted in case of numeric values. The used categories can be found in the `categories_` attribute. :Drop category: Specifies a methodology to use to drop one of the categories per feature. This is useful in situations where perfectly collinear features cause problems, such as when feeding the resulting data into a neural network or an unregularized regression. However, dropping one category breaks the symmetry of the original representation and can therefore induce a bias in downstream models, for instance for penalized linear classification or regression models. None : retain all features (the default). ‘first’ : drop the first category in each feature. If only one category is present, the feature will be dropped entirely. ‘if_binary’ : drop the first category in each feature with two categories. Features with 1 or more than 2 categories are left intact. array : drop[i] is the category in feature X[:, i] that should be dropped **PolynomialFeatures** Generate polynomial and interaction features. :Only interaction features produced: If true, only interaction features are produced: features that are products of at most degree distinct input features (so not x[1] ** 2, x[0] * x[2] ** 3, etc.). :Include bias: If True (default), then include a bias column, the feature in which all polynomial powers are zero (i.e. a column of ones - acts as an intercept term in a linear model). :Degree: The degree of the polynomial features. :Preserve as dataframe: Preserve as Dask dataframe :Order: Order of output array in the dense case. ‘F’ order is faster to compute, but may slow down subsequent estimators. **RobustScaler** Scale features using statistics that are robust to outliers. :Scale to interquartile range: If True, scale the data to interquartile range. :Copy: If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned. :IQR Quantile range - Lower: Quantile range used to calculate `scale_`. :IQR Quantile range - Upper: Quantile range used to calculate `scale_`. :Center the data before scaling: If True, center the data before scaling. This will cause transform to raise an exception when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory. **SimpleImputer** Simple imputation for missing data in tabular datasets :Add indicator: If True, a MissingIndicator transform will stack onto output of the imputer’s transform. This allows a predictive estimator to account for missingness despite imputation. If a feature has no missing values at fit/train time, the feature won’t appear on the missing indicator even if there are missing values at transform/test time. :Fill value for missing values: When strategy == “constant”, fill_value is used to replace all occurrences of missing_values. If left to the default, fill_value will be 0 when imputing numerical data and “missing_value” for strings or object data types. :Copy: If True, a copy of X will be created. If False, imputation will be done in-place whenever possible. Note that, in the following cases, a new copy will always be made, even if copy=False: If X is not an array of floating values; If X is encoded as a CSR matrix; If add_indicator=True. :Missing values: The placeholder for the missing values. All occurrences of missing_values will be imputed. For pandas’ dataframes with nullable integer dtypes with missing values, missing_values should be set to np.nan, since pd.NA will be converted to np.nan. :Strategy for missing values: The imputation strategy. If “mean”, then replace missing values using the mean along each column. Can only be used with numeric data. If “median”, then replace missing values using the median along each column. Can only be used with numeric data. If “most_frequent”, then replace missing using the most frequent value along each column. Can be used with strings or numeric data. If there is more than one such value, only the smallest is returned. If “constant”, then replace missing values with fill_value. Can be used with strings or numeric data. :Verbose: Controls the verbosity of the imputer. **StandardScaler** Standardize features by removing the mean and scaling to unit variance :Scale to unit variance: If True, scale the data to unit variance (or equivalently, unit standard deviation). :Copy: If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned. :Center the data: If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory. Definition :::::::::: Input ports =========== **dataset** dataset Dataset Output ports ============ **dataset** dataset Dataset Configuration ============= **Add indicator** (Add indicator) (no description) **Categories** (Categories) (no description) **Center the data** (Center the data) (no description) **Center the data before scaling** (Center the data before scaling) (no description) **Copy** (Copy) (no description) **Degree** (Degree) (no description) **Desired data type** (Desired data type) (no description) **Drop category** (Drop category) (no description) **Fill value for missing values** (Fill value for missing values) (no description) **Handle Unknown** (Handle Unknown) (no description) **IQR Quantile range - Lower** (IQR Quantile range - Lower) (no description) **IQR Quantile range - Upper** (IQR Quantile range - Upper) (no description) **Include bias** (Include bias) (no description) **Missing values** (Missing values) (no description) **Negative Label** (Negative Label) (no description) **Only interaction features produced** (Only interaction features produced) (no description) **Order** (Order) (no description) **Positive Label** (Positive Label) (no description) **Preserve as dataframe** (Preserve as dataframe) (no description) **Scale to interquartile range** (Scale to interquartile range) (no description) **Scale to unit variance** (Scale to unit variance) (no description) **Strategy for missing values** (Strategy for missing values) (no description) **Transformed array in sparse format** (Transformed array in sparse format) (no description) **Use categorical** (Use categorical) (no description) **Verbose** (Verbose) (no description) **Algorithm** (algorithm) (no description) **Columns** (columns) Columns that should be converted. **maximum_categories** (maximum_categories) (no description) **norm** (norm) (no description) **threshold** (threshold) (no description) Implementation ============== .. automodule:: node_transformdataset :noindex: .. class:: TransformTableDataset :noindex: